Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores by tomjen12 · Pull Request #19 · PyFR/GiMMiK

tomjen12 · 2026-06-18T05:44:03Z

Summary

Adds tuned HIP GiMMiK preload and vector-width variants for PyFR-style matrix multiplication cases.

The new variants include C-preload kernels for cstream/bstream and aligned width-2 preload variants. Existing and new HIP templates now use shared non-temporal load/store helpers for C accesses.

Results

On MI300X PyFR GiMMiK matmul benchmarks across 53 double-precision p2-p5 cases, bandwidth efficiency versus a 4.4 TB/s target improved from:

Baseline: 76% min, 84% geomean
This branch: 86% min, 94% geomean

Companion PyFR PR: PyFR/PyFR#567

Test plan

Ran PyFR HIP GiMMiK matmul correctness checks against NumPy reference results.
Ran MI300X double-precision p2-p5 benchmark cases.

tomjen12 · 2026-06-18T06:20:18Z

Hi @FreddieWitherden , could you please review this PR when you have a chance?
I also noticed I cannot request reviewers from the sidebar. Is reviewer assignment limited to repository members?

FreddieWitherden · 2026-06-18T12:56:39Z

+          if A[j, kx] != 0:
+              nzixs.append((l_idx, kx))
+
+      has_dotp = any(A[j, kx] != 0 for kx in range(k))


As A is a numpy array this can likely be simplified as A[j].any() (I think!). Same for the above for where we can eliminate the explicit Python loop with something built upon A[j, kbx].

FreddieWitherden · 2026-06-18T12:57:47Z

+
+% if width == 2:
+static inline __device__ ${dtype}
+gimmik_vmul(${dtype[:-1]} a, ${dtype} b)


Might be worth moving these up into a shared file which other kernels can include via Mako. Keeps all of the vector stuff in one place.

FreddieWitherden · 2026-06-18T13:05:49Z

Overall looks good. Few quick things:

Confirm nothing breaks on the consumer SKU's. Performance is not critical here but some of our users use RDNA GPUs for local development and wave32 vs 64 has tripped us up in the past.
As we're now emitting more kernels this can have a negative impact on the start-up time of PyFR when they are JIT'ed for the first time. If you have reasonable heuristics it might be adopting the approach of PTX Backend #18 which has JSON files with tuning info (and am working on that PR to get the code moved into the base class so all generators can share it). PyFR passes in the wave_size and GCN arch name (queried from HIP) to GiMMiK which can be used for dispatch. This is especially useful if a certain trick always yields a benefit.

FreddieWitherden · 2026-06-18T14:59:02Z

Couple of other notes (mostly for me):

Confirm compatibility with PyFR v3.1 which is the latest released version. This only uses the static n kernel variants so shouldn't have any issues with wide width kernels (since GiMMiK provides the exact block sizes).
Confirm no regressions on MI250X which is widely used due to its deployment on supercomputers in the US and Europe. It is fine if performance doesn't improve, but we want to avoid regressions.

tomjen12 · 2026-06-22T08:55:21Z

Thanks for the review and advice.
Updates made based on your comments:

Simplified the NumPy row checks using array operations, following the suggestion around A[j].any() and avoiding explicit Python loops where possible.
Moved the vector helper code into a shared Mako include so the width/vector logic is kept in one place and reused by the HIP kernels.
Confirmed compatibility with PyFR v3.1. The optimized HIP variants pass the correctness checks with PyFR v3.1.
Confirmed MI250X (gfx90a) correctness as well. The 22 variants pass accuracy checks on the tested gfx90a cases, and performance also looks improved overall.

For dispatch, I am currently gating the tuned HIP variants using the HIP-reported GCN arch and wave size:

base_arch = gcn_arch.split(':', 1)[0] if gcn_arch else None
if base_arch not in {'gfx90a', 'gfx942'} or warp_size != 64:
    return

This enables the tuned variants for validated gfx90a and gfx942 wave64 targets while avoiding other architectures for now.

RDNA / consumer GPU testing is still pending. I am currently looking for access to a suitable machine to verify wave32 behavior before enabling anything there.

FreddieWitherden · 2026-06-22T14:26:46Z

Thanks!

Going through the PR there appear to be three main improvements.

Preloading Current GiMMiK kernels attempt to load parts of $C$ on demand, whereas the PR also considers doing the loads all upfront. If this is found to always be beneficial then we probably can eliminate the on-demand paths. (The original thoughts around incremental loading were to get some extra ILP, but I don't think I ever considered preloading.)
Non-temporal loads/stores These are used in some guise on NVIDIA already. Again, if we find they are always beneficial we should drop support for the non-temporal path. If they're neutral we can probably go either way. If there are cases where they help and others where they regress we should see if there is a simple heuristic to guide us. For example, on x86 CPUs we use non-temporal stores when the matrix is above a certain size (32 MiB if I recall). Should we keep both it is worth unifying the kernels. This can be done through either inline device functions, classic C preprocessor macros, or passing Python lamdba's to the Mako templates and evaluating them to yield the correct syntax. All of these approaches are fine.
Wider kernels This is something we used to have on HIP but it was removed in 07220ea Is there any reason this path can not be brought back as-is? The benefit is that it shares a kernel template with the current code.

All in, if I've understood everything correctly I do not think we need any new templates, just extra parameterization on the templates we have. This tends to make things more manageable and easier to understand so it would be preferable. Again, it is possible I've missed something about the new kernels which really does require a different overall structure, so let me know there.

FreddieWitherden · 2026-06-22T21:07:55Z

Have just run the PR through my benchmark suite (p = 2 to 5, double precision, ~4 GiB of data so we always asymptote) and find the overall performance to be about the same (geomean ratio PR/base: 1.003x). Did your benchmarking also include the rocBLAS kernels? It could be that the wins are coming from kernels where rocBLAS does better than either the new or old GiMMiK kernels.

tomjen12 · 2026-06-23T01:20:44Z

Thanks for checking this.

Could you share a bit more detail about your benchmark setup?

What platform/GPU are you using for the test?
What ROCm version or Docker image are you using?
Could you share the benchmark script or command line you used? I can run the same setup on my side for comparison.
Also, the new HIP variants are currently gated on validated wave64 targets:

base_arch = gcn_arch.split(':', 1)[0] if gcn_arch else None
if base_arch not in {'gfx90a', 'gfx942'} or warp_size != 64:
    return

Could you confirm whether your test platform reports gfx90a or gfx942 with warp_size == 64? If not, the new tuned variants may not be emitted. As a quick check, could you also try temporarily disabling this arch gate and rerun the benchmark?

tomjen12 · 2026-06-23T01:22:19Z

Regarding the template structure, I will also take another look and see how much can be merged back into the existing HIP templates.

tomjen12 · 2026-06-23T10:11:52Z

Thanks for the detailed feedback.

I have updated the HIP kernels so that width is now a template parameter rather than requiring separate width-specific templates. This lets the existing templates generate the wider variants directly.

For the non-temporal C load/store path, I tested the four basic HIP GiMMiK variants over the 53 PyFR matrix cases. Across 212 per-variant comparisons, the non-temporal path wins 190 cases and gives a +10.51% geomean bandwidth improvement. There are a few per-variant regressions, but for the affected cases other variants remain non-regressing, so the autotuner should still be able to avoid a loss in the final selected kernel. Based on this, I think using the non-temporal C path by default is justified.

For preload-C, the result is different. It is not uniformly beneficial: preload-C wins 83/212 comparisons, while the on-demand path wins 129/212, with an overall geomean bandwidth speedup of 0.94x. Some cases do improve substantially, so I think preload-C is still useful to keep as additional autotune candidates, but it should not replace the on-demand variants outright.

FreddieWitherden · 2026-06-23T20:33:22Z

Could you confirm whether your test platform reports gfx90a or gfx942 with warp_size == 64? If not, the new tuned variants may not be emitted. As a quick check, could you also try temporarily disabling this arch gate and rerun the benchmark?

It was a caching issue on my end. I can now reproduce your results on an MI300X w/ROCm 7.14.

FreddieWitherden · 2026-06-23T20:44:50Z

Thanks for the detailed feedback.

I have updated the HIP kernels so that width is now a template parameter rather than requiring separate width-specific templates. This lets the existing templates generate the wider variants directly.

Thank you for investigating. Assuming non-temporal loads/stores are widely available (so compiles for older CDNA and RDNA) I think the best course of action is to enable them everywhere all the time for HIP. This simplifies the code and puts us at parity with CUDA where we also ask for quasi-non temporal stores.

In terms of preloading I agree with making this an option in the auto-tuner. Extra kernel variations are not as big of a deal for HIP since the kernels are generic on n whereas on CUDA the sizes are baked into the code so we pay more of a hit.

Only other thing to check is that we yield kernels in a sensible order. To keep start-up times down PyFR can be instructed to only consider the first m kernels from GiMMiK. Hence, it is worth seeing if we can yield them in a sensible-ish order (or better still, if there are combos which never seem to win out, just prune them on our end). No need to strive for perfection here, it is more a "nice to have" rather than something essential.

(The long term goal is to make things adaptive; PyFR when it compiles a kernel passes information about the register use and spillage back to GiMMiK. At the moment this is discarded, but could be useful in the future for things like "don't bother with a width = 2 kernel if width = 1 spills".)

Finally, while it is not a priority at the moment, can you check there are no major regressions single precision.

tomjen12 · 2026-06-24T09:58:56Z

Thanks.

I've removed the nt_c option for HIP. Non-temporal loads/stores for C are now always enabled, which simplifies the HIP path.

I also checked single precision and do not see any major regressions. For the 53 GiMMiK-accessible cases tested, the geometric mean is about a 4% improvement, with the worst case at 98% of the previous baseline, which is likely within run-to-run noise.

For the kernel yield order, I agree this is worth looking at. My colleague is planning to push a follow-up commit with the non-temporal B optimization. Once that lands, I can use the updated tuning data to pick a better order.

On memory-bound operators the B matrix is read once from HBM and reused only within a work-group via LDS -- it is never re-read across blocks. A normal global load still allocates B's line in L2, which is pure overhead: the line is never reused, it evicts genuinely-reusable data, and it adds cache-allocate/eviction traffic. This is the read-side mirror of the non-temporal C store we already use. NTB loads B with a non-temporal load (load_b -> __builtin_nontemporal path) so B bypasses L2 instead of polluting it. It moves the same number of bytes but keeps the cache clean, raising effective bandwidth. Implemented as a flag on the existing templates rather than new files: - base.mako: add a load_b() wrapper (non-temporal B load). - bstream-msplit{,-preload-c}.mako: gate the B read behind an `ntload` flag (context.get('ntload', False)); renders byte-identically to the plain kernel when the flag is absent. - hip.py: emit `*-ntb` variants by passing ntload=True inside the existing width loop, so NTB combines with width (w1/w2) automatically. Backward-compatible (plain variants unchanged) and CDNA-gated like the other tuned variants. On MI300X (gfx942) NTB passes the accuracy check (~1e-15) and wins the autotune in the majority of memory-bound cases (~+4.5% bandwidth on those), being chosen over the plain bstream-msplit.

FreddieWitherden · 2026-06-24T14:52:47Z

@@ -0,0 +1,104 @@
+<%inherit file='base'/>


Can we merge this into the msplit kernel which preload as an option using % if/else as appropriate to switch between the two?

Done. I merged the preload-C path into the existing msplit template behind a preload option, and removed the separate preload-C template.

FreddieWitherden · 2026-06-24T15:27:58Z

 { return 0; }
 % endif

+% if width == 1:


Can we define overloads like the CUDA backend does:

https://github.com/PyFR/GiMMiK/blob/master/gimmik/kernels/cuda/base.mako#L13

Just keeps the code a little cleaner and more consistent.

Done. I updated the HIP base template to use vector operator overloads in the same style as CUDA. I omitted operator+= since HIP already provides it for vector types and defining it here caused an overload ambiguity.

Thanks! Is the += a runtime issue that should be reported upstream? Seems odd to provide an overloaded for += but nothing else. (If I recall, earlier versions of ROCm provided a full suite of overloads, but that caused issues with CUDA code compatibility where none are provided, so newer versions switch to not providing overloads. Did += sneak through?)

Reduce the HIP tuned kernel search space from 28 variants to 12 and order the remaining variants to try common winners earlier.

tomjen12 · 2026-06-25T07:19:27Z

I've pushed another round of cleanup based on the review feedback:

Non-temporal B loads are now the default. On MI300X, across the 53-case benchmark set, this moves the geomean from ~94% to ~97% of the 4.4 TB/s reference bandwidth.
The HIP templates now use natural vector operators, while avoiding custom operator+= overloads due to ambiguity with the HIP runtime definitions.
The preload-C variants have been merged into the base templates via a preload option.
The tuned HIP search space has been reduced from 28 variants to 12, and the remaining variants are ordered so the commonly winning candidates are tried earlier. With this smaller set, the MI300X benchmark geomean remains at ~97% of the 4.4 TB/s reference bandwidth.

tomjen12 · 2026-06-25T11:30:43Z

Added back bstream-msplit/m4-b24-x64 and cstream-ksplit/k2-c24-x64 because MI355 benchmarks show these baseline variants are still competitive and sometimes faster than the newer tuned variants. Keeping them in the candidate list gives autotune the option to select them when appropriate.

tomjen12 added 2 commits June 18, 2026 00:06

Add tuned HIP GiMMiK preload variants

8e23d63

Fix HIP GiMMiK block size metadata

8f4d03e

tomjen12 marked this pull request as draft June 18, 2026 05:47

tomjen12 marked this pull request as ready for review June 18, 2026 05:48

tomjen12 mentioned this pull request Jun 18, 2026

Support HIP GiMMiK vector-width metadata PyFR/PyFR#567

Open

FreddieWitherden reviewed Jun 18, 2026

View reviewed changes

FreddieWitherden self-assigned this Jun 18, 2026

tomjen12 added 3 commits June 21, 2026 23:18

Address HIP GiMMiK review comments

96671a6

Handle ROCm feature suffixes for gfx942 tuning

739a82e

Enable tuned HIP variants on gfx90a

0633539

Parameterize HIP vector width and refine preload kernels

7b59fb0

Use blockx launch bounds for HIP cstream preload

e9b921a

Always use non-temporal C accesses for HIP

2aa2577

FreddieWitherden reviewed Jun 24, 2026

View reviewed changes

Use non-temporal B loads by default for HIP

280e948

tomjen12 added 4 commits June 24, 2026 23:00

Make HIP preload-C a template option

c06216d

Avoid HIP vector operator+= overloads

e014e4d

Prune HIP tuned variants to 12

f6bc308

Reduce the HIP tuned kernel search space from 28 variants to 12 and order the remaining variants to try common winners earlier.

Remove HIP variant arch gate

a3aee45

Restore MI355 HIP baseline variants

2c7af9b

Uh oh!

Conversation

tomjen12 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Test plan

Uh oh!

tomjen12 commented Jun 18, 2026

Uh oh!

FreddieWitherden Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

FreddieWitherden Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

FreddieWitherden commented Jun 18, 2026

Uh oh!

FreddieWitherden commented Jun 18, 2026

Uh oh!

tomjen12 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FreddieWitherden commented Jun 22, 2026

Uh oh!

FreddieWitherden commented Jun 22, 2026

Uh oh!

tomjen12 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomjen12 commented Jun 23, 2026

Uh oh!

tomjen12 commented Jun 23, 2026

Uh oh!

FreddieWitherden commented Jun 23, 2026

Uh oh!

FreddieWitherden commented Jun 23, 2026

Uh oh!

tomjen12 commented Jun 24, 2026

Uh oh!

FreddieWitherden Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

tomjen12 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

FreddieWitherden Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

tomjen12 Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

FreddieWitherden Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

tomjen12 commented Jun 25, 2026

Uh oh!

tomjen12 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomjen12 commented Jun 18, 2026 •

edited

Loading

tomjen12 commented Jun 22, 2026 •

edited

Loading

tomjen12 commented Jun 23, 2026 •

edited

Loading